Agentic Browser

Documentation

Back to Home
Home Projects Agentic Browser Service Integrations Website Analysis Integration

Website Analysis Integration

Table of Contents#

  1. Introduction

  2. Project Structure

  3. Core Components

  4. Architecture Overview

  5. Detailed Component Analysis

  6. Dependency Analysis

  7. Performance Considerations

  8. Troubleshooting Guide

  9. Conclusion

  10. Appendices

Introduction#

This document explains the website analysis service integration, focusing on how HTML is converted to markdown, how request metadata is processed, and how advanced scraping capabilities are implemented. It also covers website validation mechanisms, content extraction patterns, and data transformation workflows. The super scraper functionality for intelligent web content extraction, DOM manipulation, and structured data processing is documented alongside practical examples of website analysis workflows, content processing patterns, and validation strategies. Finally, it addresses ethical scraping, rate limiting, performance optimization, and troubleshooting for common issues.

Project Structure#

The website analysis pipeline spans several layers:

  • Routers define HTTP endpoints for website analysis and validation.

  • Services orchestrate content fetching, conversion, and LLM-driven answer generation.

  • Tools implement HTML-to-markdown conversion, server-side markdown fetching, and a super scraper for advanced content extraction.

  • Prompts define the instruction templates and chains used to synthesize answers.

  • Configuration manages environment variables and logging.

graph TB subgraph "Routers" R1["website.py
POST /"] R2["website_validator.py
POST /validate-website"] end subgraph "Services" S1["website_service.py
WebsiteService"] S2["website_validator_service.py
validate_website"] end subgraph "Tools" T1["website_context/__init__.py
Exports"] T2["html_md.py
return_html_md"] T3["request_md.py
return_markdown"] T4["super_scraper.py
clean_response"] end subgraph "Prompts" P1["website.py
Prompt + Chain"] P2["prompt_injection_validator.py
Validation Template"] end subgraph "Models" M1["requests/website.py
WebsiteRequest"] M2["response/website.py
WebsiteResponse"] end subgraph "Config" C1["core/config.py
Environment & Logging"] end R1 --> S1 R2 --> S2 S1 --> T1 T1 --> T2 T1 --> T3 S1 --> P1 S2 --> P2 S1 --> M1 S1 --> M2 S2 --> M1 S2 --> M2 S1 --> C1 S2 --> C1

Diagram sources

Section sources

Core Components#

  • WebsiteService orchestrates the end-to-end website analysis:

    • Fetches server-side markdown via a Jina AI proxy.

    • Converts client-provided HTML to markdown.

    • Builds a prompt chain with server and client contexts plus optional chat history.

    • Optionally integrates an attached file via the Google GenAI SDK.

    • Returns a synthesized answer from the LLM.

  • WebsiteValidatorService validates HTML content by converting it to markdown and checking for prompt injection risks using a dedicated prompt and LLM.

  • Routers expose endpoints for website analysis and validation with request/response models.

  • Tools implement:

    • HTML-to-markdown conversion.

    • Server-side markdown fetching via Jina AI.

    • Super scraper for advanced content extraction with DOM filtering and asynchronous loading.

  • Prompts define the instruction templates and chains for answer synthesis and validation.

  • Configuration manages environment variables and logging.

Section sources

Architecture Overview#

The system follows a layered architecture:

  • HTTP layer: FastAPI routers accept requests and delegate to services.

  • Service layer: WebsiteService and WebsiteValidatorService encapsulate business logic.

  • Tool layer: Utilities for HTML/markdown conversion and content fetching.

  • Prompt layer: Instruction templates and chains for LLM interactions.

  • Configuration layer: Environment and logging setup.

sequenceDiagram participant Client as "Client" participant Router as "website.py" participant Service as "WebsiteService" participant Tools as "Tools" participant Prompt as "Prompts" participant LLM as "LLM" Client->>Router : POST "/" with WebsiteRequest Router->>Service : generate_answer(url, question, chat_history, client_html) Service->>Tools : markdown_fetcher(url) Tools-->>Service : server_markdown Service->>Tools : html_md_convertor(client_html) Tools-->>Service : client_markdown Service->>Prompt : get_answer(chain, question, server_markdown, chat_history, client_markdown) Prompt->>LLM : invoke(prompt + inputs) LLM-->>Prompt : answer Prompt-->>Service : answer Service-->>Router : answer Router-->>Client : WebsiteResponse

Diagram sources

Detailed Component Analysis#

WebsiteService#

WebsiteService coordinates:

  • Server-side markdown retrieval via Jina AI.

  • Client-side HTML-to-markdown conversion.

  • Chat history formatting.

  • Optional attached file processing via Google GenAI SDK.

  • LLM answer synthesis using a composed prompt chain.

Key processing logic:

  • Validates presence of required fields.

  • Fetches server markdown and logs length.

  • Converts client HTML to markdown when provided.

  • Formats chat history into a string.

  • Handles attached file upload and generation via Google GenAI if present.

  • Falls back to LLM-based answer synthesis otherwise.

flowchart TD Start(["generate_answer Entry"]) --> Validate["Validate url and question"] Validate --> FetchServer["Fetch server markdown via Jina AI"] FetchServer --> ConvertClient{"client_html provided?"} ConvertClient --> |Yes| ToMarkdown["Convert client HTML to markdown"] ConvertClient --> |No| SkipClient["Skip client markdown"] ToMarkdown --> BuildHistory["Format chat history"] SkipClient --> BuildHistory BuildHistory --> Attached{"attached_file_path provided?"} Attached --> |Yes| UploadFile["Upload file via Google GenAI SDK"] UploadFile --> GenerateAnswer["Generate content with LLM"] Attached --> |No| UseChain["Use prompt chain to get answer"] GenerateAnswer --> Return["Return answer"] UseChain --> Return Return --> End(["Exit"])

Diagram sources

Section sources

WebsiteValidatorService#

WebsiteValidatorService performs:

  • HTML-to-markdown conversion.

  • Prompt injection risk assessment using a dedicated prompt template and LLM.

  • Boolean safety determination based on model output.

sequenceDiagram participant Client as "Client" participant Router as "website_validator.py" participant Service as "validate_website" participant Tools as "html_md.py" participant Prompt as "prompt_injection_validator.py" participant LLM as "LLM" Client->>Router : POST "/validate-website" with WebsiteValidatorRequest Router->>Service : validate_website(request) Service->>Tools : return_html_md(html) Tools-->>Service : markdown Service->>Prompt : create PromptTemplate + LLM chain Service->>LLM : invoke({markdown_text}) LLM-->>Service : result Service->>Service : parse result to is_safe Service-->>Router : WebsiteValidatorResponse Router-->>Client : WebsiteValidatorResponse

Diagram sources

Section sources

Tools: HTML to Markdown and Server-Side Fetching#

  • HTML-to-Markdown converter uses BeautifulSoup and html2text to normalize and convert HTML bodies to markdown.

  • Server-side markdown fetcher uses a Jina AI proxy to retrieve clean markdown from URLs.

  • Super scraper leverages WebBaseLoader with BeautifulSoup filters and asynchronous loading to extract structured content.

graph LR A["HTML Input"] --> B["BeautifulSoup Parser"] B --> C["Normalize Body"] C --> D["html2text Conversion"] D --> E["Markdown Output"] F["URL Input"] --> G["Jina AI Proxy"] G --> H["Clean Markdown Output"] I["URL Input"] --> J["WebBaseLoader"] J --> K["SoupStrainer Filter"] K --> L["Async Load Documents"] L --> M["Structured Document Output"]

Diagram sources

Section sources

Prompts and Chains#

  • The website prompt defines a two-context synthesis strategy: server-fetched markdown and client-rendered markdown, with guidelines for summaries, structure, links/media, code, metadata, data analysis, and formatting.

  • A runnable chain composes the prompt with the LLM client and an output parser.

  • The validator prompt checks for prompt injection attempts and returns a boolean safety signal.

classDiagram class PromptTemplate { +template : string +input_variables : list } class RunnableLambda { +invoke(data) any } class RunnableParallel { +invoke(inputs) dict } class StrOutputParser { +parse(output) string } PromptTemplate <.. RunnableLambda : "used by" RunnableParallel --> PromptTemplate : "feeds" PromptTemplate --> StrOutputParser : "outputs"

Diagram sources

Section sources

Request/Response Models#

  • WebsiteRequest includes URL, question, optional chat history, optional client HTML, and optional attached file path.

  • WebsiteResponse wraps the generated answer.

classDiagram class WebsiteRequest { +string url +string question +dict[] chat_history +string client_html +string attached_file_path } class WebsiteResponse { +string answer }

Diagram sources

Section sources

MCP Server Integration#

The MCP server exposes tools for website analysis:

  • website.fetch_markdown: Fetches markdown content for a given URL via a Jina proxy.

  • website.html_to_md: Converts raw HTML to markdown.

sequenceDiagram participant Client as "MCP Client" participant MCP as "MCP Server" Client->>MCP : website.fetch_markdown({url}) MCP-->>Client : markdown content Client->>MCP : website.html_to_md({html}) MCP-->>Client : markdown content

Diagram sources

Section sources

Dependency Analysis#

  • WebsiteService depends on:

    • Tools for markdown fetching and HTML conversion.

    • Prompts for constructing the answer chain.

    • Configuration for logging.

  • WebsiteValidatorService depends on:

    • Tools for HTML-to-markdown conversion.

    • Validator prompt and LLM for safety assessment.

  • Routers depend on:

    • Models for request/response validation.

    • Services for business logic.

graph TB WS["WebsiteService"] --> TMF["markdown_fetcher"] WS --> HMC["html_md_convertor"] WS --> PROMPT["website.py Prompt Chain"] WS --> CFG["config.py Logger"] WV["WebsiteValidatorService"] --> HMC WV --> VPROMPT["prompt_injection_validator.py"] WR["website.py Router"] --> WS WVR["website_validator.py Router"] --> WV WR --> WM["WebsiteRequest/Response Models"] WVR --> WM

Diagram sources

Section sources

Performance Considerations#

  • Asynchronous loading: The super scraper uses asynchronous document loading to improve throughput when fetching multiple pages.

  • Selective parsing: BeautifulSoup filters limit parsing to relevant DOM sections, reducing overhead.

  • Caching and reuse: Reuse server-fetched markdown and client-provided markdown to avoid redundant conversions.

  • Rate limiting and retries: Integrate retry logic and backoff when calling external services like Jina AI and Google GenAI.

  • Timeout configuration: Set explicit timeouts for network requests to prevent long blocking operations.

  • Chunking and pagination: For very large pages, consider chunking content before passing to the LLM to manage token limits.

  • Environment tuning: Adjust logging levels and environment variables for production deployments to minimize overhead.

[No sources needed since this section provides general guidance]

Troubleshooting Guide#

Common issues and resolutions:

  • HTTP 400/500 errors from website router:

    • Ensure URL and question are provided in the request payload.

    • Check service logs for detailed error messages.

  • Empty or malformed markdown:

    • Verify the URL resolves correctly and returns HTML.

    • Confirm client HTML is well-formed when passed for conversion.

  • Prompt injection validation failures:

    • Review the validator response and sanitize HTML accordingly.

    • Consider additional sanitization steps before conversion.

  • Google GenAI file processing errors:

    • Confirm API keys are configured and accessible.

    • Validate the file path and permissions.

  • Network timeouts or rate limits:

    • Add retry logic with exponential backoff.

    • Monitor external service availability and adjust timeouts.

Section sources

Conclusion#

The website analysis integration combines robust content fetching, intelligent HTML-to-markdown conversion, and LLM-driven synthesis to deliver accurate answers from web pages. Validation ensures safety against prompt injection, while advanced scraping tools enable structured content extraction. By following the outlined workflows, patterns, and best practices, teams can deploy reliable, ethical, and high-performance web analysis capabilities.

[No sources needed since this section summarizes without analyzing specific files]

Appendices#

Example Workflows#

  • Basic website analysis:

    • Client posts WebsiteRequest to the website router.

    • Service fetches server markdown, optionally converts client HTML, builds the prompt chain, and returns an answer.

  • Website validation:

    • Client posts WebsiteValidatorRequest to the validator router.

    • Service converts HTML to markdown and runs the validator prompt; returns a safety decision.

  • Super scraper usage:

    • Invoke the super scraper to asynchronously load and filter content from a URL, returning a structured document for downstream processing.

Section sources